Skip to content

SAA vs SAW COGS experiment#305

Draft
dandavison wants to merge 40 commits intomainfrom
saa-cogs
Draft

SAA vs SAW COGS experiment#305
dandavison wants to merge 40 commits intomainfrom
saa-cogs

Conversation

@dandavison
Copy link
Copy Markdown
Contributor

No description provided.

Both SAW and SAA use GenericExecutor with a simple Execute function.
SAW gets a dedicated minimal workflow registered on the existing Go worker.
Reuse existing "payload" activity registration. Drop "cogs" from names.
WorkerOptions.FlagSet() now takes a prefix parameter. The outer CLI
passes "worker-" (so users write --worker-max-concurrent-activities),
and passthrough() strips it for the subprocess. The Go worker binary
passes "" so it accepts the stripped names, matching dotnet/python/
typescript/java workers.
m.fs.IntVar(&m.ActivityPollerAutoscaleMax, prefix+"activity-poller-autoscale-max", 0, "Max for activity poller autoscaling (overrides max-concurrent-activity-pollers)")
m.fs.IntVar(&m.WorkflowPollerAutoscaleMax, prefix+"workflow-poller-autoscale-max", 0, "Max for workflow poller autoscaling (overrides max-concurrent-workflow-pollers)")
m.fs.Float64Var(&m.WorkerActivitiesPerSecond, prefix+"activities-per-second", 0, "Per-worker activity rate limit")
m.fs.BoolVar(&m.ErrOnUnimplemented, prefix+"err-on-unimplemented", false, "Fail on unimplemented actions (currently this only applies to concurrent client actions)")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is addressing what seemed to be a bug in omes.

New activity "payloadWithRetries" fails for N attempts then succeeds.
Both scenarios accept --option fail-for-attempts=N (default 0, no retries).
Retry backoff is 1ms with coefficient 1.0 to minimize wait time.
The Go SDK's PollActivityExecution uses a 10s default gRPC timeout
when the context has no deadline. With 9 activity retries at server-
enforced ~1s backoff, the activity takes >10s total, hitting this
limit. Pass an explicit 60s timeout context to handle.Get().
Fixes PollActivityExecution 10s default timeout bug for standalone
activity handle.Get() when context has no deadline.
The previous commit only upgraded the worker module. The starter
(scenarios/loadgen) uses the root module, which is where handle.Get()
runs and hits the 10s default gRPC timeout bug.
- Standalone activity StartToCloseTimeout 5s -> 30s (tight timeout
  caused failures at high throughput; irrelevant for COGS experiment)
- Default executor rate 1000/s -> 500/s to start conservatively
- Fix fish variable scoping bug in run-executor.fish (yq_expr was
  local to if block, causing placeholder image)
- Add worker tuning flags and image to patch-worker.fish
- Support API key auth in deploy-omes scripts
SAA scenario options (with defaults):
  - start-to-close-timeout-seconds (30)
  - schedule-to-close-timeout-seconds (120)
  - get-timeout-seconds (120)

Also bump SAW workflow's activity StartToCloseTimeout from 5s to 30s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant